Data Science Project on Amazon Best Selling Books Analysis with the Python Programming Language¶

About the dataset: This dataset contains information about top selling books of Amazon(550 books) either belonging to fiction or non-fiction.

The data I'll be using in this data science project is a dataset of Amazon's 50 Best Books between 2009 and 2019.

Task 1: Reading and Inspection¶

### Subtask 1.1: Import and Read

Import and Read the Amazon Best Selling Book dataset and Store it in a variable called data.

Importing Libraries¶

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading the Dataset¶

data = pd.read_csv("bestsellers with categories.csv")
data.head()

Subtask 1.2: Inspect the dataframe¶

Inspect the dataframe's columns, shapes, variable types, unique values etc.

Check Shape¶

r,c = data.shape
print(f"The dataset has {r} rows and {c} columns.")

The dataset has 550 rows and 7 columns.

Check Datatype of Each Column¶

data.dtypes

Name            object
Author          object
User Rating    float64
Reviews          int64
Price            int64
Year             int64
Genre           object
dtype: object

Detail Information about Dataset¶

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550 entries, 0 to 549
Data columns (total 7 columns):
Name           550 non-null object
Author         550 non-null object
User Rating    550 non-null float64
Reviews        550 non-null int64
Price          550 non-null int64
Year           550 non-null int64
Genre          550 non-null object
dtypes: float64(1), int64(3), object(3)
memory usage: 30.2+ KB

Check The Total Unique Values In Each Column¶

data.nunique()

Name           351
Author         248
User Rating     14
Reviews        346
Price           40
Year            11
Genre            2
dtype: int64

Get An Overview Of The Dataframe¶

data.describe()

Note: It is clear from the above table that maximum price of a book is 105$ and maximum rating is 4.9.

Task 2: Cleaning the Data¶

### Subtask 2.1: Inspect Null values

Find out the number of Null values in all the columns and rows

Column-Wise Null Count¶

data.isnull().sum()

Name           0
Author         0
User Rating    0
Reviews        0
Price          0
Year           0
Genre          0
dtype: int64

Row-Wise Null Count¶

data.isnull().sum(axis=1).sort_values(ascending=False)

549    0
180    0
186    0
185    0
184    0
      ..
372    0
373    0
374    0
375    0
0      0
Length: 550, dtype: int64

Subtask 2.2: Check if the dataset has some duplicate values?¶

data[data.duplicated()]

len(data[data.duplicated()])

0

Note: There is no duplicate row found in the dataset

Check Total number of different books¶

len(data.Name.unique())

351

Note: Total number of unique books are 351 but total rows are 550.
This is clear if we see the dataframe that some books are repeatedly among the top sellers in different years.

Check The Duplicate Books¶

data[['Name','Author']].duplicated().sum()

199

Task 3: Data Analysis¶

### Subtask 3.1: Find the Book with highest profit
1. Create a new column called Profit which contains the product of the two columns: Reviews and Price.
2. Sort the dataframe using the profit column as reference.
3. Extract the top ten profiting Books in descending order and store them in a new dataframe - top10

data['Profit'] = data['Reviews'] * data['Price']
data

Top10 = data.sort_values(by='Profit',ascending=False).head(10)
Top10

### Subtask 3.2: Drop duplicate values

After you found out the top 10 profiting books, you might have notice a duplicate value. So, it seems like the dataframe has duplicate values as well. Group the dataframe by name and repeat Subtask 3.1.

Top10 = data.groupby('Name')[['Profit']].max().sort_values(by='Profit',ascending=False).head(10)
Top10

sns.barplot(x=Top10.Profit,y=Top10.index.values)
plt.xlabel("Earned(millions)")
plt.ylabel("Books")
plt.title('Earning by Books')

Text(0.5, 1.0, 'Earning by Books')

### Subtask 3.3: Which Genre Has The Most Books In This Category and Their Distribution?

data.Genre.unique()

array(['Non Fiction', 'Fiction'], dtype=object)

data.Genre.value_counts()

Non Fiction    310
Fiction        240
Name: Genre, dtype: int64

sns.countplot(x="Genre", data=data)
plt.show()

plt.pie(data.Genre.value_counts(),labels=['Non Fiction','Fiction'],autopct='%.0f%%');

Non Fiction books are in majority in top selling category.

### Subtask 3.4: What is The Average Rating of Each Genre and Plot Histogtam of each Rating

data.groupby('Genre')['User Rating'].mean()

Genre
Fiction        4.648333
Non Fiction    4.595161
Name: User Rating, dtype: float64

sns.histplot(data=data['User Rating'],bins=10)
plt.xlabel("Ratings")

Text(0.5, 0, 'Ratings')

### Subtask 3.5: Find the Relationship of Ratings with Time.

sns.lineplot(y=data['User Rating'],x=data['Year'],hue=data['Genre']);
plt.ylabel("Ratings")
plt.xlabel("Years");

It is clear from the above graph that most of the books received ratings between 4.5 to 4.9.

### Subtask 3.6: Find Relationship Between Ratings and Price

Plot Data and Regression Model Fits Across a FacetGrid.¶

sns.lmplot(y='User Rating',x='Price',data=data)
plt.ylabel('Ratings')
plt.xlabel('Price');

Note: As we seen price increases, rating goes down.

data.groupby('Genre')['User Rating'].mean()

Genre
Fiction        4.648333
Non Fiction    4.595161
Name: User Rating, dtype: float64

This graph shows that there is not a significant relationship between price and ratings but with increasing price , the ratings are falling for both Fiction and Non fiction.

### Subtask 3.6: Find How Much Books Have Earned Yearly.

data.groupby('Year').Profit.sum().sort_values(ascending=False)

Year
2014    10625500
2012     8929419
2019     8336955
2013     8321579
2016     7951530
2015     7745165
2018     7183575
2017     6669195
2011     5548689
2010     3620509
2009     3567282
Name: Profit, dtype: int64

data.groupby('Year').Profit.sum().plot(kind='bar')
plt.show()

data.groupby(['Year','Genre']).Profit.sum().sort_values(ascending=False)

Year  Genre      
2014  Fiction        6858148
2013  Fiction        5444489
2012  Fiction        5098394
2018  Non Fiction    4713219
2019  Non Fiction    4500705
2015  Fiction        4364798
2016  Fiction        4064705
      Non Fiction    3886825
2019  Fiction        3836250
2012  Non Fiction    3831025
2014  Non Fiction    3767352
2017  Non Fiction    3568679
2015  Non Fiction    3380367
2017  Fiction        3100516
2011  Non Fiction    3082561
2013  Non Fiction    2877090
2018  Fiction        2470356
2011  Fiction        2466128
2009  Fiction        2058643
2010  Non Fiction    1930069
      Fiction        1690440
2009  Non Fiction    1508639
Name: Profit, dtype: int64

Plot using Seaborn¶

sns.barplot(x=data['Year'],y=data['Profit'],hue=data['Genre'])
plt.xlabel('Years')
plt.ylabel("Eearned(millions)")
plt.title('Money earned each year');

Above graph above shows the earning of these top books ,in most cases, are below 0.8 million.

### Subtask 3.7: Average Profit Earned by Each Book Depending on its Genre.

genre_average=data.groupby(['Genre'])['Profit'].mean()
genre_average

Genre
Fiction        172720.279167
Non Fiction    119504.938710
Name: Profit, dtype: float64

sns.barplot(x=genre_average.index,y=genre_average);

### Subtask 3.8:. Find Books Which Earned The Most Per Year (2009-19)

data.groupby('Year')['Profit'].max()

Year
2009     394680
2010     474768
2011     508470
2012     661710
2013     701295
2014    1396161
2015    1430028
2016     700492
2017     458730
2018     672463
2019    1317615
Name: Profit, dtype: int64

data.groupby('Year')['Profit'].transform(max)

0       700492
1       508470
2       672463
3       458730
4      1317615
        ...   
545    1317615
546     700492
547     458730
548     672463
549    1317615
Name: Profit, Length: 550, dtype: int64

data[data.groupby('Year')['Profit'].transform(max) == data['Profit']]

most_earning_book_per_year=data[data.groupby('Year')['Profit'].transform(max) == data['Profit']]
most_earning_book_per_year

most_earning_book_per_year=most_earning_book_per_year.sort_values('Year').set_index('Year')
most_earning_book_per_year

### Subtask 3.9: What is the Average Earning of Each Genre on per Year Basis

genres_per_year_mean=data.groupby(['Year','Genre'])[['Profit']].mean().round(2)
genres_per_year_mean

### Subtask 3.10: Trend of books selling (Top rated) over the years

Earning_Graph=data.groupby('Year')['Profit'].sum()
Earning_Graph

Year
2009     3567282
2010     3620509
2011     5548689
2012     8929419
2013     8321579
2014    10625500
2015     7745165
2016     7951530
2017     6669195
2018     7183575
2019     8336955
Name: Profit, dtype: int64

sns.lineplot(data=Earning_Graph)
plt.xlabel('Year')
plt.ylabel("Earned")
plt.title("EARNING PER YEAR")
plt.figure(figsize=(12,12));

<Figure size 864x864 with 0 Axes>

### Subtask 3.11: Top 10 authors which earned the most

authors=data.groupby('Author')['Profit'].sum().sort_values(ascending=False).head(10)
authors

Author
American Psychological Association    3946800
Suzanne Collins                       3368646
E L James                             2517303
John Green                            2381609
Laura Hillenbrand                     2284821
Paula Hawkins                         1986150
Gillian Flynn                         1660859
Gary Chapman                          1516167
Dr. Seuss                             1423598
American Psychiatric Association      1402590
Name: Profit, dtype: int64

sns.barplot(y=authors.index,x=authors)
plt.title('The Money Makers ')

Text(0.5, 1.0, 'The Money Makers ')

	User Rating	Reviews	Price	Year
count	550.000000	550.000000	550.000000	550.000000
mean	4.618364	11953.281818	13.100000	2014.000000
std	0.226980	11731.132017	10.842262	3.165156
min	3.300000	37.000000	0.000000	2009.000000
25%	4.500000	4058.000000	7.000000	2011.000000
50%	4.700000	8580.000000	11.000000	2014.000000
75%	4.800000	17253.250000	16.000000	2017.000000
max	4.900000	87841.000000	105.000000	2019.000000

	Name	Author	User Rating	Reviews	Price	Year	Genre	Profit
0	10-Day Green Smoothie Cleanse	JJ Smith	4.7	17350	8	2016	Non Fiction	138800
1	11/22/63: A Novel	Stephen King	4.6	2052	22	2011	Fiction	45144
2	12 Rules for Life: An Antidote to Chaos	Jordan B. Peterson	4.7	18979	15	2018	Non Fiction	284685
3	1984 (Signet Classics)	George Orwell	4.7	21424	6	2017	Fiction	128544
4	5,000 Awesome Facts (About Everything!) (Natio...	National Geographic Kids	4.8	7665	12	2019	Non Fiction	91980
...	...	...	...	...	...	...	...	...
545	Wrecking Ball (Diary of a Wimpy Kid Book 14)	Jeff Kinney	4.9	9413	8	2019	Fiction	75304
546	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2016	Non Fiction	114648
547	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2017	Non Fiction	114648
548	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2018	Non Fiction	114648
549	You Are a Badass: How to Stop Doubting Your Gr...	Jen Sincero	4.7	14331	8	2019	Non Fiction	114648

	Profit
Name
The Girl on the Train	1430028
The Alchemist	1396161
Where the Crawdads Sing	1317615
Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5	701295
Harry Potter Paperback Box Set (Books 1-7)	700492
The Goldfinch: A Novel (Pulitzer Prize for Fiction)	676880
Becoming	672463
Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series)	661710
The Fault in Our Stars	656266
A Game of Thrones / A Clash of Kings / A Storm of Swords / A Feast of Crows / A Dance with Dragons	592050

	Name	Author	User Rating	Reviews	Price	Year	Genre	Profit
382	The Girl on the Train	Paula Hawkins	4.1	79446	18	2015	Fiction	1430028
338	The Alchemist	Paulo Coelho	4.7	35799	39	2014	Fiction	1396161
534	Where the Crawdads Sing	Delia Owens	4.8	87841	15	2019	Fiction	1317615
70	Diagnostic and Statistical Manual of Mental Di...	American Psychiatric Association	4.5	6679	105	2014	Non Fiction	701295
69	Diagnostic and Statistical Manual of Mental Di...	American Psychiatric Association	4.5	6679	105	2013	Non Fiction	701295
159	Harry Potter Paperback Box Set (Books 1-7)	J. K. Rowling	4.8	13471	52	2016	Fiction	700492
393	The Goldfinch: A Novel (Pulitzer Prize for Fic...	Donna Tartt	3.9	33844	20	2014	Fiction	676880
392	The Goldfinch: A Novel (Pulitzer Prize for Fic...	Donna Tartt	3.9	33844	20	2013	Fiction	676880
33	Becoming	Michelle Obama	4.8	61133	11	2019	Non Fiction	672463
32	Becoming	Michelle Obama	4.8	61133	11	2018	Non Fiction	672463

	Name	Author	User Rating	Reviews	Price	Genre	Profit
Year
2009	Publication Manual of the American Psychologic...	American Psychological Association	4.5	8580	46	Non Fiction	394680
2010	Unbroken: A World War II Story of Survival, Re...	Laura Hillenbrand	4.8	29673	16	Non Fiction	474768
2011	The Hunger Games Trilogy Boxed Set (1)	Suzanne Collins	4.8	16949	30	Fiction	508470
2012	Fifty Shades of Grey: Book One of the Fifty Sh...	E L James	3.8	47265	14	Fiction	661710
2013	Diagnostic and Statistical Manual of Mental Di...	American Psychiatric Association	4.5	6679	105	Non Fiction	701295
2014	The Alchemist	Paulo Coelho	4.7	35799	39	Fiction	1396161
2015	The Girl on the Train	Paula Hawkins	4.1	79446	18	Fiction	1430028
2016	Harry Potter Paperback Box Set (Books 1-7)	J. K. Rowling	4.8	13471	52	Fiction	700492
2017	Player's Handbook (Dungeons & Dragons)	Wizards RPG Team	4.8	16990	27	Fiction	458730
2018	Becoming	Michelle Obama	4.8	61133	11	Non Fiction	672463
2019	Where the Crawdads Sing	Delia Owens	4.8	87841	15	Fiction	1317615